reward parameter
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
- North America > United States > Texas > Brazos County > College Station (0.14)
- Asia > Middle East > Jordan (0.04)
- (3 more...)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
- North America > United States > Texas > Brazos County > College Station (0.14)
- Asia > Middle East > Jordan (0.04)
- (3 more...)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Finite-time Convergence Analysis of Actor-Critic with Evolving Reward
Hu, Rui, Chen, Yu, Huang, Longbo
Many popular practical reinforcement learning (RL) algorithms employ evolving reward functions-through techniques such as reward shaping, entropy regularization, or curriculum learning-yet their theoretical foundations remain underdeveloped. This paper provides the first finite-time convergence analysis of a single-timescale actor-critic algorithm in the presence of an evolving reward function under Markovian sampling. We consider a setting where the reward parameters may change at each time step, affecting both policy optimization and value estimation. Under standard assumptions, we derive non-asymptotic bounds for both actor and critic errors. Our result shows that an $O(1/\sqrt{T})$ convergence rate is achievable, matching the best-known rate for static rewards, provided the reward parameters evolve slowly enough. This rate is preserved when the reward is updated via a gradient-based rule with bounded gradient and on the same timescale as the actor and critic, offering a theoretical foundation for many popular RL techniques. As a secondary contribution, we introduce a novel analysis of distribution mismatch under Markovian sampling, improving the best-known rate by a factor of $\log^2T$ in the static-reward case.
- Europe > Austria > Vienna (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (14 more...)
- North America > United States > Texas > Brazos County > College Station (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Jordan (0.04)
- (3 more...)
- North America > United States > Texas > Brazos County > College Station (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Jordan (0.04)
- (3 more...)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Texas > Brazos County > College Station (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.50)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.50)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Texas > Brazos County > College Station (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.50)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.50)
Strategyproof Reinforcement Learning from Human Feedback
Buening, Thomas Kleine, Gan, Jiarui, Mandal, Debmalya, Kwiatkowska, Marta
We study Reinforcement Learning from Human Feedback (RLHF), where multiple individuals with diverse preferences provide feedback strategically to sway the final policy in their favor. We show that existing RLHF methods are not strategyproof, which can result in learning a substantially misaligned policy even when only one out of $k$ individuals reports their preferences strategically. In turn, we also find that any strategyproof RLHF algorithm must perform $k$-times worse than the optimal policy, highlighting an inherent trade-off between incentive alignment and policy alignment. We then propose a pessimistic median algorithm that, under appropriate coverage assumptions, is approximately strategyproof and converges to the optimal policy as the number of individuals and samples increases.
Hypernetwork-based approach for optimal composition design in partially controlled multi-agent systems
Park, Kyeonghyeon, Concha, David Molina, Lee, Hyun-Rok, Lee, Chi-Guhn, Lee, Taesik
Partially Controlled Multi-Agent Systems (PCMAS) are comprised of controllable agents, managed by a system designer, and uncontrollable agents, operating autonomously. This study addresses an optimal composition design problem in PCMAS, which involves the system designer's problem, determining the optimal number and policies of controllable agents, and the uncontrollable agents' problem, identifying their best-response policies. Solving this bi-level optimization problem is computationally intensive, as it requires repeatedly solving multi-agent reinforcement learning problems under various compositions for both types of agents. To address these challenges, we propose a novel hypernetwork-based framework that jointly optimizes the system's composition and agent policies. Unlike traditional methods that train separate policy networks for each composition, the proposed framework generates policies for both controllable and uncontrollable agents through a unified hypernetwork. This approach enables efficient information sharing across similar configurations, thereby reducing computational overhead. Additional improvements are achieved by incorporating reward parameter optimization and mean action networks. Using real-world New York City taxi data, we demonstrate that our framework outperforms existing methods in approximating equilibrium policies. Our experimental results show significant improvements in key performance metrics, such as order response rate and served demand, highlighting the practical utility of controlling agents and their potential to enhance decision-making in PCMAS.
- North America > United States > New York (0.24)
- North America > Canada > Ontario > Toronto (0.14)
- Asia > China (0.04)